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1 INTRODUCTION 


The last two decades have witnessed an explosive growth in computer technology. 
Computer-hardware developments are most noticeable at the two extremes of the spec- 
trum. At one end there are large expensive computer systems, usually referred to as 
supercomputers, such as the CRAY-1 S, CDC^ CYBER 205, and Burroughs Scientific 
Processor (BSP), which have radically different architectures and very high per- 
formance (computational speed of the order of 100 million floating-point operations 
per second (100 MFLOPS) or more). At the other end of the spectrum are the various 
types of minicomputers, desktop computers, microprocessors, and programmable pocket 
calculators. These devices provide reasonable performance at less cost. However, 
their effectiveness is limited for solving large-scale structural problems such as 
those encountered in crash dynamics and large space structures. 

Recently, array processors, which are low-cost, high-performance units used as 
attachments to minicomputers and mainframes, have gained popularity. The question 
arises as to whether the combination of a minicomputer and an array processor in a 
dual-processor system can extend the range of finite-element problems that can be 
solved effectively by the minicomputer. This capability is likely to be accomplished 
through efficient implementation of both the numerical algorithms used in the finite- 
element solution and the communication protocol between the host minicomputer and the 
array processor. The question of synchronization between the different parts of the 
system to ensure correct and efficient function of the minicomputer/array-processor 
system is of paramount importance. Studies have been made to assess the potential of 
minicomputer/array-processor systems for large structural calculations . Potential 
speedups of some of the matrix processors used in the SPAR structural-analysis pro- 
gram are studied in reference 1 . Implementation of finite-element algorithms on a 
minicomputer with an attached array processor is discussed in references 2 and 3. 

In all the cited references, however, estimates are given for speedups of certain 
matrix operations within the finite-element program, but no speedup estimaties are 
given for the entire solution process. Furthermore, no guidelines are presented for 
realizing the full potential of the minicomputer/array-processor system in large- 
scale structural calculations. The present paper focuses on these questions. Spe- 
cifically, the objectives of the present paper are as follows: 

(1) To explore the feasibility and potential of using a minicomputer /array- 

processor system for nonlinear finite-element analysis of large struc- 
tural systems 

(2) To identify the guidelines for computational strategy and programming 

techniques required for realizing this potential 

In this study, a Prime 750 superminicomputer is the host computer, and a soft- 
ware simulator residing on the Prime is employed to assess and measure the perfor- 
mance of the Floating Point Systems (FPS) AP-120B array processor. The selection of 
the AP-120B was prompted by the fact that it is the most widely used array processor 
and the only commercially available one with a simulator on the Prime. In the pres- 
ent paper, emphasis is placed on understanding characteristics of the hardware and 
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interplay among the hardware components, selection of proper computational procedure, 
vectorization of numerical algorithms, minimization of input-output (I/O) operations, 
and overlapping host and array-processor operations. Two numerical examples are 
presented to demonstrate the gain in speed obtained by using the proposed algorithms 
on the minicomputer/array-processor system. Also, new advances in array-processor 
hardware by two commercial vendors are outlined, and possible improvements in compu- 
tational algorithms are identified. The combination of hardware and algorithm 
improvements can significantly enhance the effectiveness of these systems for large- 
scale nonlinear analysis. The methodology developed in this paper is intended to 
guide further efforts on distributed finite -element computations. 
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2 SYMBOLS 

linear-strain-displacement matrix 
nonlinear-strain-displacement vector 
material-stiffness matrix of structure 
Young's modulus of the material 
error norm, measuring error of reduced system 
error norm defined in equation (C7) 

nonlinear-stiffness coefficients of discretized structure 
nonlinear-stiffness coefficients of reduced system 
weighting coefficients for numerical quadrature, where A = 1 to n 
shell thickness 

Jacobian matrix of transformation from local to natural 
(dimensionless) element coordinates 

linear-global-stiffness matrix 

linear-stiffness coefficients of discretized structure 
= ^IJK ^IJKL % 

linear-stiffness coefficients of reduced system 
linear-element-stiffness matrix 
length of shell 

total number of displacement degrees of freedom 
number of numerical quadrature points in element domain 
normalized external-load components 
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Subscripts : 
I, J,K,L 


i, j,k, A 
max 

Superscripts : 
e 


normalized load components of reduced system 
load parameter 

radius of curvature of shell 

number of unknowns of reduced equations 

total strain energy of structure 

contribution of cubic and quartic terms in nodal displacement to 
strain energy 

displacement components in coordinate directions 


= (det[J ] ) ^ where 1=1 to n 
nodal-displacement parameters 
vector of nodal displacements 
orthogonal coordinate system 

condition number of Gram matrix of global-approximation functions 




bK 


i-1 


as defined in equation (3) 


change in a function 

path parameter in solution space 

Poisson’s ratio of the material 

vector of nodal stress resultants 

rotation components 

unknowns of reduced equations 


indices ranging from 1 to N (number of degrees of freedom of full 
system) 

indices ranging from 1 to r (number of degrees of freedom of 
reduced system) 

maximum 


elemental quantity 
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r 


rth iteration cycle (see section 12.2) 


T transposition 

Abbreviations : 


AP 

CP 

CPU 

DCU 

DOF 

FPS 

I/O 

L.H.S. 

MFLOPS 

2-D,3-D,4-D 


AP-120B array processor 
central-processing bound 
central-processing unit 
data-base complex utility 
degrees of freedom 
floating-point system 
input-output operation 
left-hand side 

million floating-point operations per second 
two-, three-, and four-dimensional, respectively 


3 SUMMARY OF MAJOR- HARDWARE AND SYSTEM-SOFTWARE FEATURES OF 
MINICOMPUTER/ARRAY-PROCESSOR SYSTEM 

The minicomputer/array-processor system has a number of features which dis- 
tinguish it from current third-generation (mainframe) hardware. These features pro- 
vide a high degree of interactive capability and attractive computational speeds. 
Although the speed of the system is below those of the large mainframes, its purchase 
cost is considerably less. A schematic drawing of the organization of an assumed 
minicomputer /array-processor system is shown in figure 1 . 

A detailed description of the hardware and software of both the host minicom- 
puter and the array processor is given in the manufacturers* manuals (refs. 4 to 8), 
and their major features, which are exploited in the present study, are summarized 
herein. 


3.1 Prime 750 Host Computer 

The Prime 750 host computer belongs to the class of computers frequently referred 
to as superminis. (See ref. 9 for a classification of hardware systems.) It has a 
32-bit-word (approximately 8 significant digits) central processor and a virtual- 
memory operating system. A basic feature of the Prime 750 is a three-level memory 
organization depicted in figure 2. The three levels are the cache memory, the central 
memory, and the disk system. The Prime 750 supports up to 2 million words (8M bytes) 
of central memory and 600 million words (2,4 x 10^ bytes) of disk storage. The system 
can run in both interactive and batch modes, and up to 63 users can be supported simul- 
taneously, The memory organization and the computational speed of the Prime 750 at the 
Langley Research Center are discussed in appendix A. 
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3*2 Floating Point Systems AP-120B Array Processor 


An array processor (AP) is a high-speed special-purpose computational device which 
can perform repetitive computations on well-structured data sets at effective speeds 
far beyond those achieved by current minicomputers. The AP-120B array processor^ also 
called an attached processor, has the following features (see ref. 10): 

(1) It is designed and accessed as a peripheral (i.e., like a tape drive) for a 

conventional host minicomputer, and it is intended to enhance the perfor- 
mance of the host in specific numerical computing tasks. 

(2) It achieves high performance through both parallelism and pipelining. 

(3) It includes an arithmetic section containing one adder and one multiplier 

capable of operating in parallel. 

(4) It is not hardwired. Rather, it can be programmed by the user in FORTRAN or 

assembler language to perform a variety of computational tasks. 

Parallelism is a major feature of the AP-120B that allows high-speed vector and 
matrix processing. In addition, the parallel structure of the AP-120B allows the 
overhead of loop indexing, array indexing, and data fetching to be performed in par- 
allel with floating-point computations. Parallel operations include integer index- 
ing, branch instructions, memory fetches, memory storage, I/O with the host, and 
floating-point arithmetic. These features make the AP-120B a much faster processor 
than most general-purpose computers which perform operations sequentially. The 
AP-120B has a peak processing rate of 1 2 million floating-point operations per second 
(12 MFLOPS) and a concurrent 6 million integer and addressing operations per second. 

The two major components of the AP-120B are the memory and the arithmetic unit. 

The organization of these two components is discussed in reference 1 1 and is s\amma- 
rized in appendix A. The memory includes main-data memory, table memory, program 
memory, data-pad registers, and address registers. (See fig. 3.) The function of 
each memory, including its size and word length, is given in appendix A. 

The arithmetic unit of the AP-120B has a two-stage pipeline adder and a three- 
stage pipeline multiplier. Each can perform up to 6 million floating-point opera- 
tions per second (6 MFLOPS). If both pipelines are kept constantly full, the speed 
can reach 12 MFLOPS, but it is often impossible to keep either or both pipelines 
full. Sustained speeds of up to 4 MFLOPS have been observed for some fluid-dynamics 
computations by W. T. Thompkins at the Massachusetts Institute of Technology. The 
interface between the host and the array processor is described in the succeeding 
subsection. 


3.3 Interface Between Host Computer and Array Processor 

There are two distinct operating systems which reside on the host: (1) the 

operating system of the host computer, and (2) a software driver for the array pro- 
cessor called **APEX” (array-processor executive) . The host operating system controls 
the operation of the host, including the user program which, in turn, specifies the 
sequence of events necessary to perform the computation. The data on which the com- 
putations are performed reside on a disk controlled by the host. Some of these data 
are present in the memory of the host, where they may undergo simple reorganization 
on their way to and from the array processor. 


5 



The host interface to the array processor consists of an interface board which 
is controlled by APEX, This software is resident on the host and facilitates com- 
munication with the array processor by translating FORTRAN calls into small descrip- 
tive packages called "function control blocks," These tasks are shipped to the array 
processor to initiate specific computations or data transfers on the latter device. 
The APEX is part of the host load module whose formation is depicted in figure 4, 
Additional details on the array-processor executive are given in appendix A, 


3,4 Comments on Operation and Performance of Minicomputer/Array- 

Processor System 

The following comments regarding the operation and performance of the 
minicomputer/array-processor system seem to be in order: 

(1) The high speed of the AP-120B can be attributed to the following three 

factors: (a) speed of the components (167-ns cycle time compared with 600-ns cycle 

time on the Prime 750); (b) parallel architecture with complete parallelism in the 
sense that all memory and arithmetic units can perform operations in parallel; and 
(c) pipeline processing of arithmetic operations, 

(2) The AP-120B performs vector operations with scalar hardware. Software loops 
constructed from scalar operations are used to program vector operations. By con- 
trast, most vector computers, such as the CRAY-1 S and the CYBER 205, provide one set 
of hardware to perform scalar operations and another set to perform vector computa- 
tions, The use of scalar hardware to perform both vector and scalar operations 
helped to keep the purchase cost of the AP-120B substantially below that of super- 
computers, The penalty for this architecture is the increase in software complexity 
for vector operations, A detailed discussion of this subject is given in refer- 
ences 6 and 10, Sample timings for vector-software operations on the AP-120B are 
given in table I, and the effect of vector length on the number of results per second 
is depicted in figure 5, 

(3) Parallelism on the AP-120B can be exploited only by the software. To syn- 
chronize and coordinate concurrent activities, the programmer or compiler must explic- 
itly code all parallelism into the program. This situation requires use of the 
lowest level of programming language (called "microcode”) and significantly compli- 
cates the software development, (See ref, 12,) 

Extensive libraries of microcoded routines (some callable from the FORTRAN pro- 
gram) are available for use on the AP-120B. These routines include an assembler 
(APAL), a loader and a linker (APLINK), a software simulator (APSIM), an on-line 
software debugger (ADBUG), an executive driver (APEX), a software chaining utility 
(VFC), and an extensive mathematics library including vector operations, matrix oper- 
ations, vector-to-scalar operations, and fast Fourier transform, (See ref, 7,) 


3,5 Simulator 

A software simulator (APSIM) has been developed by Floating Point Systems, Inc,, 
to assess the performance of the array processor through simulating the operation of 
the AP system software APEX and the execution of the assembler programs on the array 
processor. The simulator can be used on the host minicomputer for performing actual 
computation and for providing run-time estimates on the array processor without actu- 
ally using the array processor. The APSIM can run interactively, thereby providing 
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the user with maximum control over the program execution. Ttie simulation is limited 
to program execution on the AP; interaction with the host is not simulated. In the 
present study, however, the times expended in the communication between the Prime and 
the array processor are estimated to be only a small fraction of the total central- 
processing unit (CPU) time. Other difficulties observed in using the simulator are 
identified in appendix A. 


4 SUMMARY OF COMPUTATIONAL PROCEDURE AND NONLINEAR 
FINITE-ELEMENT EQUATIONS 

To simplify the presentation, discussion herein is limited to large-deflection 
static analysis of shells. The analytical formulation is based on a form of the 
geometrically nonlinear shallow-shell theory with the effects of transverse shear 
deformation included. A displacement formulation is used with the fundamental 
unknowns consisting of the three displacement components and the two rotations at 
each point of the middle surface of the shell structure. The shell is modeled by 
using 16-node quadrilateral elements with bicubic Lagrangian interpolation functions 
for each of the displacement and rotation components. 

The computational procedure used herein is based on the combined use of finite 
elements and the classical Rayleigh-Ritz approximation. The procedure has been 
referred to as the reduced-basis technique and is described in detail in refer- 
ences 13 and 14. The discretization of the shell is done by using finite elements. 
The nonlinear system of finite-element equations describing the response of the dis- 
cretized shell is then replaced by a reduced system of equations with considerably 
fewer unknowns through the application of the classical Rayleigh-Ritz technique. The 
reduced system of equations may be updated periodically as needed. 


4.1 Mathematical Formulation 

4.1.1 Governing finite-element equations .- A total Lagrangian formulation is 
used in describing the response of the shell, and the governing finite-element equa- 
tion in index notation can be cast in the following form: 


''iJ ’‘j * i ^IJK Xj \ ♦ T «IJKL ’■j \ - 0 <’ > 

where are the linear-stiffness coefficients; and are nonlinear- 

stiff ness coefficients; Xj are unknown nodal-displacement parameters; q^ are 
normalized external-load components; and q is a load parameter. The range of the 
indices I,J,K,L is 1 to N where N is the total number of displacement degrees 
of freedom, and a repeated index denotes summation over its full range. The 
coefficients ^IJK' ^IJKL completely symmetric with respect to permu- 

tation of indices. 

The loading is assumed to be conservative and proportional. As the load is 
increased, the value of the load parameter q changes, but the component Qj 
remains constant; that is, the spatial distribution of the load does not change. 

4.1.2 Basis reduction and reduced system of equations. - The nonlinear solution 
Xj is approximated, over a range of values of the parameter q, by a linear combi- 
nation of a small number of global-approximation (or basis) vectors. The basis vec- 


7 



tors consist of a nonlinear solution for a particular value of q and a number of 
its path derivatives (derivatives of Xj with respect to a path parameter \ which 
may be identified with a loading or displacement parameter) at the same value of q. 
The approximation can be expressed as 


J Jl^l 


(2) 


where J = 1 to N, i = 1 to r, r << N, (\>^ are undetermined coefficients, and 
is a transformation matrix whose columns represent global-approximation 
functions or basis vectors. Thus, 


Ji 





(3) 


A Rayleigh-Ritz technique is then used to approximate equation (1) by a much 
smaller system of nonlinear algebraic equations in the new unknowns 4;^. The reduced 
equations have the following form: 


K. . 
ID 




2 ^ijk ^ 


^ 3 


‘he - ‘J 2 . = 0 


(4) 


where 


K. . = K ^ r . r^. 

13 IJ II J 3 


F. ,, = F r . r . 

X]k UK Ii J] Kk 


"^ijkA ^IJKL ^Jj ^Kk 


Q. = Q r . 

^1 T Ii 


(5) 

( 6 ) 

(7) 

( 8 ) 


The subscripts ifj/k,X are indices ranging from 1 to r. The subscripts 
I,J,K,L are indices ranging from 1 to N. A repeated index in the same term denotes 
summation over its full range. 

The equations used in evaluating the basis vectors are obtained by successively 
differentiating the governing finite-element equation (eq. ( 1 )), and solving the 
resulting system of linear simultaneous algebraic equations. The explicit form of 
these recursion relations is given in reference 15 and is not reproduced here. 

The criterion for selecting the number of the basis vectors is based on 
monitoring the condition number of the Gram matrix p of the basis vectors and 
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on terminating the generation of the basis vectors when p exceeds a prescribed 
value, (See ref, 13,) Also, upper and lower limits for the number of basis vectors 
were chosen to be 10 and 2, respectively. 


4.2 Computational Procedure 

The five key elements of the computational procedure which strongly affect the 
performance of the reduced-basis technique are: (1) efficient evaluation of the 

basis vectors and generation of the reduced system of equations; (2) characterization 
of the nonlinear response of the shell by means of a single scalar; (3) automatic 
selection of load step size and evaluation of the corresponding displacements and 
stress resultants; (4) sensing and controlling the error in the reduced system of 
equations; and (5) tracing postbuckling and post-limit-point paths. These elements 
were discussed in detail in references 13 and 14, The reduced-basis technique appears 
to be suitable for solving large-scale nonlinear problems on the minicomputer/array- 
processor system because of the following: (1) the operations involving the large 

arrays of the full system of equations are limited to generating basis vectors and 
reduced arrays and to sensing the error of the reduced equations (see ref, 13); and 
(2) many operations required in the solution process can be performed as vector 
operations , 


4 . 3 Program Organization 

The nonlinear analysis program used in the present study uses the data-base 
management system of the SPAR finite-element system. (See ref, 16.) The program is 
divided into small self-contained logical sections or processors. Each processor can 
run as a separate program and complete its execution in an interactive mode. Com- 
munication between different processors is made through the common SPAR data base. 

The processor organization and the control and data flow in the program are dis- 
cussed in appendix B. 


5 VECTORIZATION OF ARITHMETIC OPERATIONS 

The array processor is specially designed for vector operations. The perfor- 
mance of a nonlinear finite-element program on the minicomputer/array processor system 
depends on the extent to which the pipeline and parallel-processing capabilities are 
used in the different modules of the system. Use of the scalar mode in all or most 
of the computation may provide moderate benefits (because of the small cycle time of 
the array processor and the complete parallelism in performing the operations). Full 
benefits of the array processor, however, are not obtained by use of the scalar mode. 
The process of designing a numerical algorithm to make effective use of the pipeline 
(or streaming) capability of the array processor is referred to as vectorization. 

The vectorization of any mathematical operation is dependent on the particular 
hardware. For a task to be vectorizable on the AP-120B, it should contain three 
characteristics: (1) repeated operations, (2) independence of each result from the 

others, and (3) the members of each operand are packed in either contiguous memory 
locations or at fixed intervals from each other (with some restrictions as seen in 
the discussion on interleave in appendix A) , 

In the present study, the following basic operations were vectorized: (1) eval- 
uation of linear- and nonlinear-stiffness arrays ^IJKL inde- 

pendent elements; (2) solution of linear algebraic equations using Cholesky's method; 


9 



(3) computation of the error norm and checking the convergence of the full-system 
solution in the Newton-Raphson iterative technique; (4) generation of basis vectors; 
and (5) computation of stress resultants and strain energy of the structure. These 
operations are listed in table II. Vectorization of each of the aforementioned oper- 
ations included the design of the data structure of the operands (arrays or vectors) 
as well as the sequence of the computation. For operands with simple -data structure 
(e.g., full matrices and vectors), the math-library routines (ref. 8) of the array 
processor were used to perform the vector operations. On the other hand, for operands 
with complex-data structure (e.g., sparse matrices and completely symmetric three- or 
four -dimensional arrays), assembler codes were developed to exploit the special-data 
structure of these operands, thereby improving the efficiency of the vector opera- 
tions. Details of the vectorization of the basic operations listed in table II are 
discussed in appendix C. Note that some of the vector operations listed in table II 
were overlapped with the host I/O operations, with a resulting improvement in 
efficiency. 


6 PROGRAMMING CONSIDERATIONS TO EXPLOIT MIN I COMPUTER/ ARRAY 
PROCESSOR CAPABILITIES 

After the proper computational procedure has been selected and the numerical 
algorithms have been vectorized, the realization of the full potential of the 
minicomputer/array-processor system requires the following: (1) the minimization of 

the I/O operations in order to make the program central-processing (CP) bound, and 
(2) organization of the computation and data in order to achieve maximum parallelism 
between the host and AP operations. The programming efforts to accomplish these 
tasks are outlined in this section. The major factors affecting the I/O operations 
are the host Prime 750 computer, the AP-120B, and the communication interface between 
the host and the AP as listed in table III. In the discussion of these factors, 
hardware, system software, and user software are considered with emphasis on the 
user-software techniques to minimize the I/O operations. 


6.1 Minimization of I/O Operations 

6.1.1 Minimizing the host system I/O .- The major hardware factors that affect 
I/O operations on the host include cache -memory size and speed, central -memory size 
and speed, and disk-access speed and transmission rate. The system-software factors 
affecting I/O include resource management and allocation and the paging algorithms 
between cache and central memory, as well as between central memory and disk. 

On the user-software level, minimization of I/O can be achieved through the 
proper use of buffering techniques. Six buffering techniques were used in the 
present study. Three of these techniques were proposed by Mitch Modeleski of Systems 
Applications, Inc., for use in an atmospheric-simulation model. They were found to 
be effective on the Prime 650 computer. The other three techniques were developed 
by the first author of the present paper. In order to assess the effectiveness 
of the different techniques, a small FORTRAN program was developed to transfer 
10 000 vectors of nodal displacements from central memory to disk. The CPU and I/O 
times for execution of this task were measured by using subroutines SNAP and SHOT. 

The details of the six buffering techniques and their performance are given in 
appendix D. 
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Minimizing the I/O within the array processor ,- The hardware and system- 
software features that affect the I/O within the AP are listed in table III. On the 
user-program level, I/O operations within the AP can be minimized by minimizing the 
memory-fetch operations and data-path conflicts through proper selection of the 
buffering technique used between table memory and main-data memory. The details of 
these techniques are given in appendix D. 

6.1 .3 Minimizing I/O between host computer and array processor .- In a multiuser 
operating-system environment, the host overhead can be a significant portion of the 
total time needed to run a problem on the minicomputer/array-processor system. On 
the user-software level, merging data, chaining AP routines, and proper sequencing 
of AP calls can be used to reduce host-system overhead. Details are described in 
appendix D. 


6.2 Parallel Host and Array-Processor Operations 

The parallel architecture of the AP-120B allows overlapping of data transfer 
from the host computer with the execution of arithmetic operations on the AP. After 
data have been transferred, the host can perform data management and other tasks 
during the execution of the operations on the AP. Specifically, for the nonlinear 
finite-element program considered herein, the host can perform the following three 
data-management tasks while the AP is executing the arithmetic operations: 

(1) Data storage and retrieval from disk 

(2) Data reorganization required for host and AP programs 

( 3) Synchronization of data transmission to and from the AP 

Overlapping the data transfer and execution of the AP requires bypassing the 
task-synchronization commands of the AP, which result in sequential operations, and 
designing the data structure in the program carefully, so that data sets are avail- 
able whenever they are needed. Examples of the task-synchronization commands are 
APWR (wait on running), APWD (wait on data), and APWAIT (wait on data and running). 


7 TEST PROBLEMS USED IN EVALUATION OF PROPOSED ALGORITHMS 

To assess the performance of the vectorized algorithms and the programming 
techniques outlined in the previous sections, numerical studies were made on two 
problems of elastic collapse of cylindrical-shell structures. The problems were run 
on the Prime 750 minicomputer and subsequently on the minicomputer with an AP 
simulator. Results were compared to determine the potential gain in speed obtained 
by use of the AP in conjunction with the techniques discussed in the present paper. 

The first test problem has a pear-shaped cross section (fig. 6), and the second 
structure has a circular cross section and a rectangular cutout (fig. 7). The two 
problems were used in reference 17 to assess the capability of programs to analyze 
shell structures. The load was applied to the cylinders by means of a uniform axial 
shortening which is increased incrementally until the cylinder collapses. Advantage 
was taken of the symmetry, and each structure was modeled by using 16-node, quadri- 
lateral, shear-flexible, shallow-shell elements with bicubic Lagrangian interpolation 
functions (a total of 80 generalized -displacement degrees of freedom per element) . 

The total degrees of freedom for the first structure is 1715, and for the second it 
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is 3230* The finite-element models used for the two structures are shown in fig- 
ures 6 and 7. Four distinct elements are used in modeling the first shell, and 
52 elements are used in modeling the second shell. In both problems, the basis 
vectors were generated at zero loading and, therefore, their evaluation involved the 
decomposition of the linear -global-stiff ness matrix [K] • Five basis vectors were 
used in the first problem, and six were used in the second. 

The CP and I/O times were recorded which were required for evaluating the basis 
vectors (at zero loading), generating the reduced equations and marching three steps 
with these equations. Figures 8 and 9 show the CP times expended in 10 different 
processors for the 2 cylindrical-shell problems. The 10 processors are as follows: 
(1) evaluation of linear-stiffness arrays ; (2) evaluation of nonlinear- 
stiffness arrays and ; (3) assembly of elemental matrices; (4) incor- 

poration of boundary conditions; (5) decomposition of the global-linear-stiffness 
matrix [K]; (6) forward reduction/back substitution (forward/back solve); (7) solu- 

tion update; (8) postprocessing (stress resultant and strain-energy computation); 

(9) generation of basis vectors; and (10) solution of reduced equations. Three of 
these processors were not vectorized, namely, the assembly, boundary conditions, and 
solution of reduced equations. Also, since the basis vectors were generated at zero 
loading, no convergence check was needed for the full system. The percentage of time 
expended in each of the processors and the estimated gain in speed obtained by using 
the array processor are summarized in table IV, 

As can be seen from table IV and figures 8 and 9, most of the solution time is 
spent in the processor used for the decomposition of the left-hand side and in the 
processor used for the evaluation of basis vectors and the generation of reduced 
equations. The use of the array processor is shown by the simulator to reduce the 
total times (CP + I/O) expended in these two processors by factors of 46,7 and 17,1 
for the pear-shaped cylinder and by factors of 59,8 and 38,1 for the cylinder with a 
cutout. Estimates for total gains in speed for both sample problems are presented in 
figure 10, Although the total gains in the CP speed obtained by using the AP for the 
two problems are 9,2 and 18,1, respectively, the gains in the total (CP + I/O) times 
are reduced to 5,2 and 9,9, only. Note that the higher gain in speed for the larger 
problem was due to the higher percentage of time expended in the two processors used 
for the decomposition of the global-stiffness matrix [K] and generation of basis 
vectors. Both processors were effectively vectorized. 


8 FURTHER IMPROVEMENTS TO MINICOMPUTER/ARRAY-PROCESSOR SYSTEMS 

Since the development of the AP-120B, several new advances in AP hardware have 
taken place. Two of the advanced AP systems are described herein along with improve- 
ments in numerical algorithms and software design that are likely to enhance the 
effectiveness of the minicomputer/array-processor system for large-scale, nonlinear, 
finite-element analysis. 


8,1 New Hardware 

Among the new array processors with improved capacity and performance over those 
of the AP-120B, mention may be made of the FPS-164 (ref, 18) and the CSPI MAP-6400 
(ref, 19), 
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The overall architecture of the first processor, FPS-164, is very similar to the 
AP-120B, The cycle time and machine timings are identical. However, the following 
major improvements can be identified; 

(1) 64~bit floating-point arithmetic (compared with 38 bit on the AP-120B) 

(2) 32-bit integer arithmetic (compared with 16 bit on the AP-120B) 

(3) 24-bit addressing (compared with 20 bit addressing on the AP-120B) 

(4) 1024 64-bit word-instruction cache-memory loading from main memory, which 

replaces the program memory of the AP-120B and allows larger programs to 
be accommodated in a more flexible way 

(5) A clock for timing programs (which is lacking on the AP-120B) 

(6) A 256-register subroutine call stack and 64 address registers 

(7) Main memory expandable to 7.25 million 64-bit words (58M bytes) 

A recent study has shown reductions in wall-clock time obtained by using the 
FPS-164 for a number of benchmark structural problems (900 to 3900 degrees of 
freedom) by factors of 7.5 to 10.6 over those required by the DEC VAX-11/780 
supermini. The general-purpose computer program ANSYS^ (ref. 20) was used in these 
studies . 

The second group of array processors, the CSPI MAP series, have two major 
features which distinguish them from the FPS array processors. The first is the 
asynchronous multiple-bus architecture in contrast to the FPS synchronous archi- 
tecture. The multiple asynchronous I/O buses on the CSPI MAP array processors 
significantly reduce data-path conflicts and provide fast concurrent I/O, thereby 
allowing the arithmetic units to be utilized more effectively for vector operations. 
However, difficulties can arise in coordinating several asynchronous processors for 
scalar operations. Processing speeds up to 4.3 MFLOPS have been achieved on the 
CSPI MAP-300 series. Furthermore, the asynchronous hardware architecture differs 
from the synchronous in that it allows individual components of the AP to be upgraded 
without redesigning the entire machine. The second major feature is the use of the 
shared-memory concept to eliminate the I/O overhead between the host and the array 
processor. This concept was implemented on the Gould Corporation's S.E.L. 32/77 
superminicomputer. The CSPI MAP array processors were coupled with the S.E.L. 32/77 
superminis through a common-memory interface to form VPS- 3 300 and VPS-6400 vector- 
processing systems. The former has a 32-bit word size, and the latter has a 64-bit 
word size. 

The CSPI MAP series of peripheral array processors offer considerable 
versatility when connected with different host computers. Only a single board 
replacement is necessary to interface with new host computers. The use of a CSPI 
array processor for finite -element analysis is discussed in references 2 and 3. 


^ANSYS ; 


Registered trademark of Swanson Analysis Systems, Inc. 
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8.2 Improved Algorithms and Programming Language 


To improve the effectiveness of the minicomputer/array-processor system for 
large-scale finite-element computations, improvements are needed in six areas: 

(1) Parallel host and array-processor execution: A greater degree of parallel 

processing than that used in the present study can be achieved by careful design of 
the software. 

(2) Parallel host and multiple array-processor execution: Many peripheral array 

processors can be added to the host computer. Additional levels of parallel- 
processing capability require effective algorithms on the host to control the 
parallel execution of operations on the host and on the peripheral devices. 

(3) Vectorization of packed-data structures; Parallel processing within the 
array processor of packed-data structures can be accomplished by using table memory 
to store constant offsets into the packed-data structure. 

(4) Computational algorithms for parallel processing: The most time-consuming 

algorithms in nonlinear finite-element analysis appear to be very adaptable to 
parallel processing. One criterion used in the selection of the computational 
algorithms should be to maximize parallel processing. 

(5) Expert systems on host computer: Expert systems (ref. 21) are programs 

written to perform the decision-making processes of human experts. They are a result 
of research in an area of computer science called artificial intelligence. Among the 
possible applications for this capability are the following: (a) determination of 

the optimal number of nodes per hypermatrix block to be used in the Cholesky 
decomposition and solution of the full system; and (b) allocation of central-memory 
space for user I/O buffers. 

(6) Host-computer programming language: The programming language of the 

Prime 750 host is FORTRAN, which is not designed for parallel processing. A 
parallel-processing language, such as ADA (ref. 22) or concurrent PASCAL (ref. 23), 
would be helpful for program development on a parallel-processing system. 


8.3 User Interaction 

Increased user interaction with the minicomputer /array-processor system can be 
achieved through the use of interactive graphics and relational data-base systems. 
(See ref. 24.) The latter is particularly useful to engineering users because of the 
natural form of representation of data, namely, two-dimensional tables or relations. 

A relational information manager (RIM) has recently been implemented on the Prime 750 
and the DEC VAX- 11/780 computers. (See ref. 24.) 


9 CONCLUDING REMARKS 

A study is made of the potential of using a minicomputer/array-processor system 
for large-scale, nonlinear, finite-element analysis. A Prime 750 superminicomputer 
is used as the host computer, and a software simulator residing on the Prime is 
employed to assess the performance of the Floating Point Systems AP-1 20B array 
processor. The major hardware characteristics of the system that significantly 
impact nonlinear finite-element computations are identified. Computational 
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strategies and programming techniques are developed to exploit these character- 
istics, Two benchmark nonlinear problems with 1715 and 3230 degrees of freedom are 
selected to measure the anticipated gain in speed obtained by using the algorithms 
developed on the array processor. The estimated gains in central-processing-unit 
(CPU) speed for the two problems were 9,2 and 18,1; however, the reductions in the 
total solution times, central processing and input-output (CP + I/O), were only 
5,2 and 9,9, 

The following major hardware characteristics of the Prime 750/AP-120B system 
have been identified as having a significant impact on nonlinear finite-element 
computations : 

(1 ) Virtual memory of the host Prime 750 supermini consisting of the three 
levels: cache memory, central memory, and disk, with a total storage capacity of 
over 600 million 32-bit words, 

(2) Pipeline processing through the multiple pipelined floating-point arithmetic 
elements (adder and multiplier) of the array processor, 

(3) Parallel processing achieved on three levels: (a) on the array-processor 

level, all memory and arithmetic units can perform operations in parallel; (b) on the 
host level through parallel CP and I/O operations; and (c) between the host and array 
processor (AP) , 

(4) Distributed processing: The computationally intensive portions of the 

program are allocated to the AP; and the other tasks, such as data management, 
control of the AP, and user interface, are allocated to the host supermini. 

To exploit the aforementioned hardware features of the minicomputer/array- 
processor system in a large-scale, nonlinear, finite-element analysis, the follow- 
ing tasks were performed: 

(1 ) Proper selection of the con^utational algorithm: Although a comprehensive 

study of the different computational algorithms was not made, the reduced-basis 
technique appears to be very suitable for use on minicomputer/array-processor 
systems, 

(2) Minimization of I/O operations through the following: (a) the use of 

buffering techniques on the host computer; (b) treatment of the table memory and 
main-data memory of the array processor as a two-level virtual memory; (c) merging 
smaller arrays into larger ones; and (d) chaining several array-processor routines 
into a single routine in order to reduce the number of subroutine calls, 

(3) Vectorization of numerical algorithms: Although most of the computation- 

ally intensive processors of the program have been vectorized in the present study, 
the efficiency of vector operations on arrays with a packed-data structure needs 
improvement, 

(4) Synchronization of parallel operations on the host and array processor. 
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On the basis of the present study, the following two conclusions seem to be 
justified; 

(1) The use of the aforementioned computational strategies and programming tech- 
niques developed in the present study make large-scale, nonlinear, finite-element 
computation on the minicomputer/array-processor system feasible. Ihe computational 
speeds achieved are comparable to those of the large mainframes. 

(2) The peak-performance rates of the AP-120B at 1 2 million floating-point 
operations per second (12 MFLOPS) cannot be achieved in practical finite-element 
computation. Computational speeds of 2 to 3 MFLOPS are possible for some of the 
vector operations. Further improvement in speed requires the development of low- 
level microcodes to take advantage of the parallelism in the system. This task can 
be fairly complicated. 


Langley Research Center 

National Aeronautics and Space Administration 
Hampton, VA 23665 
January 20, 1983 
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10 APPENDIX A 


DISCUSSION OF MAJOR-HARDWARE AND SYSTEM- SOFTWARE FEATURES OF 
MINICOMPUTER/ARRAY-PROCESSOR SYSTEM 

The memory organization and computational speed of the host Prime 750 supermini- 
computer are discussed in this appendix along with the memory organization, 
arithmetic units, executive software, and the simulator of the AP-120B array 
processor. 


10.1 Prime 750 

10.1 .1 Memory organization. - A basic feature of the Prime 750 minicomputer 
at the Langley Research Center is its three-level memory organization depicted in 
figure 2. The first level of storage is the cache memory with a capacity of 4000 
words (16K bytes) where each word is 32 bits (approximately 8 significant digits). 

The second level is the central memory whose capacity is 0.5 million words 

(2M bytes), and the lowest level of storage is a moving-head-disk system with a total 
capacity of 75 million words (300M bytes) • The central processing unit can reference 
only data and code stored in the cache memory. Cache memory provides the CPU with 
high-speed access to central memory. It contains information that is a duplicate to 
that of central memory. The Prime 750 has an advertised cache-hit rate of 95 per- 
cent, meaning that 95 percent of the memory accesses requested by the CPU are found 
in cache memory. This reduces central-memory access time from a nominal of 600 ns to 
an effective 110 ns (cache-memory access time is 80 ns), thus allowing faster memory 
access for critical-performance portions of the code. 

Transfer rates between the three memories are drastically different. The trans- 
fer rate from disk to central memory is 2 million words per second, not including 
setup time for each access. The setup time, which is the time needed to service the 
request and find the data on the disk, averages 600 ms. A maximum of one segment 
(32 000 words) can be transferred per disk call (8 ms). The average disk setup and 
transfer times range from 600 to 608 ms. Central-memory access times are 64.6 times 
faster than disk access time for an average segment transfer. Access times on cache 
memory are 7.4 times faster than those on central memory. 

The memory system of the Prime 750 can be viewed from two different levels: 

10.1.1.1 User level: The user views central memory as being divided into seg- 

ments of 32 000 words each. The user program allocates central-memory space for the 
storage of data used by the program and can create user buffers for transferring data 
from central memory to disk. Special precautions have to be taken when user buffers 
cross segment boundaries . 

10.1.1.2 System level; The system views and manages memory from a higher level 
than the user. If the user program and data cannot fit into central memory, the 
operating system decides which pages (groups of 512 words) will reside in central 
memory and which pages are transferred to disk. When the program references its 
address space for data located on disk, a page fault occurs and the desired page of 
data is brought into central memory while another page is moved out. This automatic 
overlay system is referred to as virtual memory. 
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The system also manages the use of cache memory through a microcoded program* 
When the CPU requests data from cache and the data are not there, the data are copied 
from central memory (which obtains pages from disk) in two-word blocks. The second 
word is automatically copied into cache to reduce the number of central-memory 
accesses in a prefetching procedure. The CPU can now access the data item. This 
creates an essentially two-level paging system, (See, for example, ref, 25,) The 
system memory-management procedures are invisible to the user. The system also 
creates I/O buffers that are invisible to the user. The size of some of these system 
I/O buffers can be controlled by the user as will be illustrated in a later section. 

Note that if the Prime 750 system routines are used for I/O operations, then the 
data must be contained within one segment or a condition of "wrap around" occurs. As 
an example of "wrap around," consider an array {X} which has the dimension 10, If 
the first 5 elements of {x} are located at the end of 1 segment and the remaining 
5 elements of {x} are located at the beginning of an adjacent segment, then a user 
request for the 10 items {x} from disk results in reading the first 5 items cor- 
rectly, However, the second 5 are read from the data and/or the code at the begin- 
ning of the first segment instead of being read from the beginning of the second 
segment, that is, locations 6 to 10 of {x}. This condition necessitates the careful 
outlay of a program's central memory and I/O operations to the disk. As a result, 

32 000 words (1 segment) is the maximum size of a user I/O buffer. This maximum is 
allowed only when the buffer starts at the beginning of a segment. Note that the 
wrap-around problem occurs only with disk I/O operations using Prime system 
routines. User buffers larger than one segment can be used, provided the Prime 
system I/O routine is not used. 

10,1.2 Computational speed .- In table AI the computational speed of the 
Prime 750 is compared with that of other minicomputers and mainframes for a niamber 
of benchmark structure calculations. (See ref, 26.) The test was performed in a 
single-user environment. Table AI illustrates the performance measured in terms 
of CPU times available in the minicomputer cost range. The addition of an array 
processor to the Prime 750 system achieves the mainframe performance (i.e., CPU 
speeds) while remaining well below the mainframe cost range. 

Timings of the nonlinear finite-element code described in subsequent sections 
were measured by using Prime system routine "TIMDAT," In order to verify the manu- 
facturer's time estimates (ref. 4), a small test program was developed and used to 
measure the speed of various floating-point operations. Table All shows that the 
measured timings fall within the manufacturer's range for each floating-point opera- 
tion, The range of speed shown in the table occurs because in a multiuser environ- 
ment the user has no control over whether cache memory or central memory is used for 
the double floating-point load operation. Therefore, all timings presented were 
measured in a single-user environment for consistency. 


10.2 AP-120B Array Processor 

The overall architecture of the Floating Point Systems AP-120B array processor 
is based on multiple special-purpose memories feeding two floating-point pipelined 
arithmetic units via multiple data paths. The details of the memory organization and 
arithmetic units are discussed subsequently, 

10.2.1 Memory organization .- Memory of the AP-120B includes main-data memory, 
table memory, program memory, data -pad registers, and address registers. (See 
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fig, 3,) A description of the function of each memory, including its size and word 
length, is given as follows: 

10*2*1.1 Main-data memory: Main-data memory is the main storage file for the 

array processor and serves as the main interface (buffers) for communication with the 
host computer. It is composed of memory banks of 4000 or 16 000 38-bit words and is 
expandable to 320 000 38-bit floating-point words. Main-data memory is available in 
two speed ranges: 333 ns (slow) and 167 ns (fast). These times indicate the fre- 

quency with which a program can reference main-data memory. All AP-120B timings in 
this paper assume the use of fast main-data memory. 

10.2.1.2 Table memory: Table memory is used to store floating-point constants 

and slowly changing data. This memory is available in ROM (read only memory) or RAM 
(random access memory) with a maximum size of 64 000 38-bit words. 

10.2.1.3 Program memory: Program memory stores AP-120B programs in a maximum of 

4000 64-bit words. 

10.2.1.4 Data pads: Data-pad registers are referred to as X and Y. Each can 

store 32 intermediate floating-point numbers. Two sets of registers are needed since 
the floating-point multiplier and adder require different operands with parallel- 
access capability . 

10.2.1.5 Address registers: The 16 address registers contain integer data items 

of size-16 bits. The inability of the AP-120B to achieve the maximum of 12 MFLOPS in 
most floating-point arithmetic operations is due to insufficient parallel data flow 
to keep the floating-point multiplier and adder fully occupied. Figure A1 details 
the data flow and interconnection network of the AP-120B. 

The user has complete control over the flow of data in the AP memories. This is 
a complex task when timing characteristics of the different memories are considered. 
The following discussion will first address a timing characteristic called "inter- 
leave” which is peculiar to the main-data memory, and then the timing characteristics 
of other memories within the AP will be discussed. 

Interleave is the sequential reference to different banks of main-data memory. 
Main-data memory is divided into banks to facilitate faster access than standard 
main-data memory, which takes three instruction cycles (500 ns). Each bank holds odd 
or even memory locations. Memory references to main-data memory are now allowed every 
333 ns for slow memory and 167 ns for fast memory as long as interleave is not 
broken. If interleave is broken, the AP executes a "spin" operation which halts all 
processing until the memory-reference instruction can be satisfied. The AP assembler 
programmer must be careful that his data are accessed sequentially and that main- 
data-memory accesses have at least one instruction cycle between them for slow main- 
data memory. The instruction cycle on the array-processor hardware is 167 ns. All 
separate memories and arithmetic units in the AP are synchronized to the instruction 
cycle. For example, the contents of a main-data-memory location become available to 
the program three instruction cycles after the main-data-memory address register has 
been altered. Reference instructions to main-data memory are allowed only every 
333 or 167 ns with the assumption that interleave is not broken. Data from table 
memory are available two instruction cycles after the table-memory access register is 
changed. Data from address registers and data-pad registers are available during the 
instruction cycle in which they are referenced. Data-pad X and data-pad Y cannot 
be referenced in the same instruction cycle. Further timing complications occur when 
considering the pipeline architecture of the floating-point adder and multiplier. 
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10, 2*2 Arithmetic units ’Ihe floating-point arithmetic units in the AP-120B 
have pipeline architecture which segments arithmetic computation into a sequence of 
basic operations (such as exponent comparison, coefficient alignment, addition, and 
normalize shift). The arithmetic unit can perform basic operations simultaneously 
on independent pairs of data elements, with each pair at a different stage in the 
computation. 

There are two types of data elements on the array processor, scalar and vector, 

A scalar is a single-valued element of data usually contained in one storage word, A 
vector is an array of scalars usually contained in either contiguous locations of 
memory or at fixed intervals from each other. If the same operation is performed on 
successive data pairs (vector operation), results are available as frequently as 167 ns 
for some operations. Such operations are called vector operations and can lead to 
substantial gains in speed over scalar arithmetic operations where one operation is 
performed at a time. 

Floating Point Systems, Inc,, designed the arithmetic units of the AP-120B for a 
balance of vector and scalar capabilities, (See ref. 11,) This results in a vector 
processor that performs well on short vectors and sequential operations. Imple- 
mentation of this idea involved adjusting the cycle time of the AP to shorten the 
arithmetic-unit pipelines to only two or three stages. Other vector processors, for 
example, the CDC STAR-100, have much longer pipelines and perform poorly on scalar 
operations. 

Each vector operation on the array processor has associated with it a delay 
before the first result emerges from the pipeline. This time is called start-up time 
and is independent of the vector length. The start-up times on the AP-120B range 
from 333 to 500 ns for various floating-point operations. 

The AP-120B has a two-stage pipeline adder and a three-stage pipeline multi- 
plier, (See fig, A2.) Each can perform up to 6 million floating-point operations 
per second (6 MPLOPS) . 


10,3 AP Executive Software (APEX) 

A table is maintained by APEX that describes the contents of the program-source 
memory in the AP, When the AP subroutine is called from the host program, APEX 
checks the subroutine table to see if the called routine is present in program-source 
memory. If the routine is not in memory and space is available in program-source 
memory, the program instructions are loaded normally. The routine's name and offset 
are added to APEX’s table. However, if space is not available, APEX overwrites 
routines in program-source memory on a last-in, first-out basis, APEX loads 
parameters from the array-processor subroutine call statement into address registers 
and initiates execution of the program by the AP-120B, The time that elapses between 
the array-processor subroutine call in the FORTRAN program and the execution of the 
instructions in the AP-1 20B is "host overhead” for that call. This overhead is 
typically from 100 to 1000 ps per array-processor subroutine call, (vSee ref, 8.) 

Host overhead becomes very significant with a large number of calls to the array 
processor. 

An I/O processor and programmable I/O processor are provided to allow direct- 
data transfer to a peripheral storage device (disk) by the array processor. Depen- 
dence on the host interface is thereby greatly reduced. 
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On the array-processor side the user can code operating-system functions to 
facilitate communication with the outside world, of which the host computer is a 
part. This is accomplished through direct communication with both the host operating 
system and APEX. 

Communication and data-transf er operations between the AP and host computer are 
processed by a combination of hardware and software. There are three basic stages to 
consider: 

(1) Before the array processor is used, its presence has to be acknowledged by 
the host computer through an initialization procedure. 

(2) APEX is loaded as part of the host load module. (See fig. 4.) 

(3) APEX is activated, thus allowing the array processor to run as an 
independent computer. 

There are three ways to control the attached processor: 

(1) Direct control: Function control blocks are generated, transferred, and 

executed separately. 

(2) Function lists: A group of function control blocks are generated and 

transferred to the AP. These function lists can be executed many times without 
transfers or regeneration. 

(3) User written-assembler routines: The most effective and most time consuming 

routine involves writing the program directly for the AP, thereby making the best use 
of its characteristics. This is required for algorithms involving sparse matrices 
and operations with overpacked data structures. 


10.4 Comments on Use of AP-120B Simulator 

Although the AP-120B simulator (APSIM) was very useful in assessing the 
performance of the array processor AP-120B, the following four difficulties were 
observed in using it: 

(1) The simulation is limited to progrcun execution on the cirray processor, and 
the interaction with the host computer is not simulated. 

(2) The size of the main-data memory on the simulator is considerably smaller 
than that of the AP-120B (8192 words as compared with a maximum of 1 -million words on 
the AP-120B). Therefore, smaller-size arrays and vectors are used. 

(3) The single command that allows AP subroutines to be called from the AP 
assembler program (JSR command) is not operational on the simulator. This 
necessitated the use of two instructions (JUMP and BRANCH) to perform the same 
operation. 
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(4) The computational speed of the simulator is dependent on the speed of the 
host computer and is considerably slower than that of the array processor. There- 
fore, for effective use of the simulator in the present study, small sample problems 
were used to debug the code developed, and then the larger problems were run in a 
batch mode. Computation times were recorded to assess the gain in speed obtained by 
using the array processor. 
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TABLE AI.- CPU TIMES FOR BENCHMARK PROBLEMS OF 
STRUCTURES LISTED IN REFERENCE 26 


Mainframe 
(cost range, 
$2M to $6M) 

Time, 

sec 


Minicomputer 
(cost range, 
$50K to $300K) 

Time, 

sec 

CDC CYBER 175 

2 


DEC VAX-1 1/780 

23 

IBM 360/95 

3 


Prime 750 

32 

IBM 370/168 

4 


DEC PDP- 11/70 

42 

CDC 6600 

8 


Prime 500 

47 

IBM 360/75 

10 


S.E.L. 32/75 

52 

CDC CYBER 173 

12 


Prime 400 

65 

UNIVAC 1108 

15 


SIGMA V 

71 




S.E.L. 32/55 

72 


] 


MODCOMP IV 1 

85 


TABLE All.- ESTIMATED AND MEASURED TIMING FOR DOUBLE-PRECISION 
(64-BIT) FLOATING-POINT OPERATIONS ON THE 
PRIME 750 COMPUTER 


Operation 

Time, |is 

Estimated (ref. 4) 

Measured 

Load 

0.47 


Store 

®0.93 to 2.61 


Add + Load + Store 

®3.17 to 4.85 

3.45 

Multiply + Load + Store 

^6.14 to 7.82 

6.42 

Divide + Load + Store 

^9.39 to 1 1 .07 

9.89 

Loop overhead 

^3.02 to 3.14 



^Range shown is due to fast cache memory and slower central 
memory . 


23 













APPENDIX A 


First stage Second stage Third stage 




Figure A2.- The AP-120B arithmetic pipelines. 
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PROGRAM AND PROCESSOR ORGANIZATION 

The program organization, control flow, and data flow for the computational 
procedure discussed in sections 4.1 and 4.2 are discussed in this appendix. The 
program is divided into self-contained units or processors whose organization is also 
discussed herein. 


11.1 Program Organization 

A computer program based on the reduced-basis technique was implemented on the 
CDC CYBER 175 computer under the NOS operating system at the Langley Research Center 
in a mostly batch-oriented environment. In the present study, the program was 
modified to make it more suitable for the interactive environment of the Prime 750 
and to exploit the hardware characteristics of the minicomputer/array-processor 
system. The major features of the modified program are as follows: 

(1) The program is divided into small self-contained logical sections or 
processors. Each processor can run as a separate program and complete its execution 
in an interactive time frame, thereby increasing the user control over the flow of 
computation. 

(2) Communication between different processors is made through a common data 
base. The data-base complex utility (DCU) of the finite-element program system SPAR 
(see ref. 16) was selected for performing the data-manipulation and data-management 
tasks in the modified program. The DCU has been designed to minimize the disk I/O 
operations on the Prime computer. 

(3) Most of the arithmetic operations were vectorized to exploit the pipeline- 
processing capability of the array processor, and an attempt was made to overlap the 
host and array-processor operations. The details of the vectorization of the 
numerical algorithms, programming techniques used for overlapping the host and array- 
processor operations, and minimizations of I/O operations are described in sections 5 
and 6. A flow chart of the modified program is shown in figure B1 . For the sake of 
clarity, a distinction is made between the full- and reduced-system processors. The 
former use full-system (global) arrays and vectors, and the latter use reduced-system 
arrays and vectors. The control flow, data flow, and processor organization in the 
modified program are discussed subsequently. 


11.2 Control Flow 

The control flow of the nonlinear finite-element program depicted in figure B1 
represents the normal sequence of program execution. In addition to the normal 
sequence, the program design has the flexibility to allow interaction by display 
processors, alteration of executing data sets or program parameters, or elimination 
of the execution of a processor whenever needed. The following comments concerning 
the control flow described by figure B1 are in order: 

(1) The individual blocks in the group of boxes marked "solve nonlinear reduced 
equations" are not numbered because they constitute a portion of the 
reduced-system solve processor. 
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(2) The reduced arrays referenced in box 6 are K. , , -vo • 

(3) Box 9 , which is designated "recover full solution,” is considered as part of 

the reduced-system processor since reduced-system data are used to form 
the full-system vector {x}* 

(4) The interface between the full and reduced systems is controlled by the 

decision box between the processors in boxes 17 and 18* This interface is 
important because full-system analysis is considerably more expensive than 
reduced-system solutions* 


11*3 Data Flow 

During execution of a processor a data-base file is created which is called a 
library. Each library is identified by a table of contents whose entries contain the 
following information (see refs. 27 and 28): 

(1) A sequence number to determine the order of data-set creation 

(2) A number indicating the relative location on disk where the data start 

(3) The date and time of data-set creation 

(4) An integer-error code 

(5) The number of words in the data set 

(6) The number of columns and number of columns times rows in the data set 

(7) An integer-data-type code 

(8) A four -part data-set name 

The information in the table of contents completely describes the data being stored 
and allows the user to access these data, print it, copy it to another library or 
data set, write it to tape, or retrieve it from tape. The data sets can also be 
assigned symbolic names and be used in equations involving matrix and vector 
computations • 


11.4 Processor Organization 

The routines represented in the flow chart in figure B1 were grouped into 
separate processors or functional units and were connected through the common data- 
base DCU. For example, F and G arrays are formed by a single processor (fig. B1 , 
box 1) and are stored on the common data base. These data are retrieved from the 
data base and are used to generate the basis vectors (fig. B1 , box 5) in the 
processor called "generate basis vectors." The individual processors can be run in 
any sequence specified by the user, provided the data needed at any stage are 
available in the data base. 
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The following advantages for dividing the program into smaller processors 
connected through a common data base can be identified; 

(1) Minimal central-memory requirements: Only COMMON BLOCKS and subroutines 

needed for the specific calculation are included in the processor, 

(2) User interaction and convenience: This includes interactive program 

development and debugging^ easy numerical and/or graphical display of data, and easy 
manipulation and modification of data between processor executions* 

(3) Faster run time: The wall-clock time can be reduced if the program is 

divided into processors. This is because the scheduling algorithms for most multi- 
user operating systems give higher priority to jobs that require small system 
resources, and the individual processors generally require smaller central-memory 
space during execution than that required by the entire program when it is run in a 
batch mode * 

(4) Reduction of program-development time: Once a processor has been debugged, 

its data have been generated and sent to the data base* That processor need not be 
loaded or executed in subsequent debugging runs* Also, having checked the 
correctness of data on the data base, the debugging is confined to the processor 
being developed* 

Adaptation of existing or new programs to the finite-element system is greatly 
simplified. Programs (e.g., graphical, data base, or computational) are easily 
connected to the data-base system through the SPAR FORTRAN interface* This is a 
significant saving in program conversion and development time. 

(5) I/O optimization; The data-base system used in conjunction with the 
processor organization centralizes the places where I/O to the disk occurs. This 
allows easy optimization of I/O for the program* The Prime system routines were used 
in the SPAR program to minimize I/O operations* 
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FULL SYSTEM REDUCED SYSTEM 

Figure B1 •- Program organization, The paired blocks (3,12), (4,10) 
and (5,15) use the same subroutines. 
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DETAILS OF VECTORIZATION OF BASIC OPERATIONS 

Some details of the vectorization of the basic operations listed in table II 
are discussed in this appendix. However, extensive work was done to improve the 
efficiency of the vector operations which is not detailed here. The data structures 
selected for the different operands are denoted by the symbols A, B, C, D, and E 
and are identified in table Cl, A list of the array-processor routines used in the 
present study is given in table Cl I, Some of the vector operations were performed by 
using the math-library routines of the array processor; others required the develop- 
ment of special routines in assembler language. 


12,1 Evaluation of Linear- and Nonlinear-Stiffness Arrays 

12,1,1 Linear-element-s tiff ness matrix ,- A linear-element-stiffness matrix 
] can be conveniently expressed in the following form (see ref, 29); 


1=1 


(Cl ) 


^1 = 


(C2) 


where [k ^ ] is an 80 x 80 matrix (16-node quadrilateral element with five degrees 
of freedom per node) , is a weighting coefficient, n is the member of quad- 

rature points, [B] is an 8 X 80 linear-strain-displacement matrix, [C] is the 
material-stiffness matrix, and det[J] is the Jacobian matrix of the transformation 
from local to natural (dimensionless) coordinates in the element domain. 


Two algorithms were tested for the vectorization of the operations involved in 
equation (Cl). In the first algorithm, microcode was developed for use on the array 
processor to form the upper triangular portion of ]. Vectors of variable 

length had to be used in order to take advantage of the symmetry of the matrix 

This resulted in increasing the overhead (associated with the extra data 
manipulation), with a consequent reduction in the speedup due to vectorization. The 
second algorithm used the library routines of the array processor (see table II) to 
form the full matrix [k ® ], No advantage was taken of the symmetry of this 
matrix. The formation of the full matrix ] using the second algorithm was 

found to be 5 percent slower than the generation of the upper triangular portion 
using the first algorithm. However, if the time required to generate the elements of 
the lower triangular portion of ] during the assembly process is taken into 

account, the second algorithm turns out to be faster and, therefore, it was used in 
the present study. 
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i 


(e) (e) 

12.1.2 Generation of the nonlinear array F .- The nonlinear array F is 
UK UK 

a sparse, completely symmetric, three-dimensional array with dimensions 80 x 80 x 80. 

(e) 

For convenience, F is partitioned into nonzero blocks of 16 x 16 x 16. (See 

UK 


fig. Cl.) The blocks are rearranged so that the central block is in the upper left- 
hand corner as depicted in the right half of figure Cl . Because of symmetry, only 
the corner block and four vertical blocks need to be generated. Moreover, the corner 
block is doubly symmetric (i.e., symmetric in two directions), and the four other 
blocks are symmetric in one direction. 


The array 


UK 


was generated by using the array processor math-library 


routines listed in table II. Each block was formed as a full array. Then, it was 
reduced to symmetric form before being stored on the data base. The speedup ratio 
obtained by using the array processor was 4.3. 


(e) 

12.1.3 Generation of the nonlinear array G .- The four-dimensional array 
_ UKL 

(e) (e) 

G is completely symmetric. Generating G was attempted by using the array- 

UKL UKL 


processor math-library routines, but it failed to improve the performance because of 
the large amount of symmetry in the G array. Data preparation by the host computer 
for the array-processor library routines took longer than the time required by 
the original host routine to form G. Therefore, a microcoded routine had to be 
developed. The speedup ratio obtained by using this routine on the array processor 
was 14.4. 


12.2 Vectorization of Newton-Raphson Iterative Technique 

The recurrence relations for the Newton-Raphson iterative technique can be 
written in the following form: 




.(r) 


(r) 


(C3) 


and 




(C4) 


where } is the correction to the nodal-displacement vector {x^^^ } obtained 

r * (r ) T 

from the rth iteration cycle. The elements of the matrix [K J and the vector 
{f^^^ } are given by 


^ ^JK \ ^IJKL \ \ 


(C5) 
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and 


■ ('^ij + i 


''iJK >‘k ^ 3 °ICKL 


K \j 


’‘j - 


An error norm e is introduced as follows: 


® =i 


, (r) ,Tf (r) , 

(ax } {ax } 


, (r+1) ,T r (r+1) , 

(x } (x } 


(C6) 


(C7) 


r 


where N is the total number of degrees of freedom. The iteration is terminated 
when either e is less than a prescribed tolerance or the number of iterations 
reaches a preselected number. For the first load increment, the initial estimate of 
{x} is chosen to be 


q= 


(C8) 


The global-stiffness matrix [k ^ ] in equation (C3) was partitioned by rows 
and columns into blocks which can be regarded as elements of a hypermatrix; this is 
analogous to the way that numbers are the elements of a matrix. An address or 
pointer matrix is used to identify the location of the nonzero submatrices in the 
hypermatrix. Zero submatrices are identified by a zero entry in the address matrix 
and are neither built nor stored. The vectors } and are parti- 

tioned into subvectors which are consistent with the hypermatrix blocks. 

The vectorization of the Newton-Raphson iterative technique can be conveniently 
divided into three phases; 

(1) Vectorization of the evaluation of the right-hand side of equation (C6) 

(2) Vectorization of the solution of equation (C3) by using the Cholesky method 

with hypermatrices 

(3) Vectorization of the solution update and convergence check of equa- 

tions (C4), (C7), and (C8) 

(e) (e) 

The first phase can be accomplished by forming the full arrays F and G 
for the individual elements from their symmetric portions and contracting them with 
the nodal displacement {x^^^ }. The details of the latter two phases are discussed 
subsequently • 


12.3 Vectorization of Cholesky* s Method With Hypermatrices 
12.3.1 Brief summary of the basic method .- The Cholesky method for solution of 
the global equations [k]{x} = {q} is expressed by the three hypermatrix equations: 
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Decomposition; 


[k] = tUl'^tD] [U] (C9) 

Forward reduction; 

tU]'^{Y} = {q} (C10) 

Back substitution: 

[D][U]{X> = {Y} (C11) 


where [U] is an upper triangular hypermatrix and [D] is a diagonal hypermatrix 
(block diagonal matrix). Ihe vector {y} is obtained by a forward reduction of the 
load vectors, and the displacement vector {x} is then calculated by back 
subs titution. 

12.3.2 Organization of the hypermatrix computation .- In the Cholesky method, 
modified for the hypermatrix approach, the submatrices play the role of the elements 
in the ordinary (scalar) Cholesky method. This requires six basic operations on 
submatrices to be performed repeatedly in the solution process. (See eqs. (C9) to 
(C11)*.) These are as follows: 

(1) Decomposition of a square positive-definite symmetric submatrix: [A] 

(2) Product of an inverse of a square matrix and a rectangular matrix: [A] [e] 


(3) Difference between a matrix and a product of three matrices of the form 

tH] - [E]'*’[A][E] 

(4) Difference between a matrix and a product of two matrices of the 

form [H] - [E]'^[S] 

(5) Product of an inverse of a square matrix and a vector: [A] ^ {Q} 

(6) Difference between a vector and a product of a matrix and a vector of 

the form 


(a) {q} - [E] {r} 

and 

(b) {Q} - [E]'^{r} 
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where [A] represents any of the diagonal submatrices in the hypermatrix 
[K]; [h] is a symmetric submatrix; [E] and [s] represent any of the off- 

diagonal submatrices; and {Q} represents a vector. In operation (2) above the 
inverse is not formed explicitly. Rather, [A] is decomposed and the product 
[A]”^ [E] is formed by forward reduction and back substitution. The six basic 
operations above, (1) to (6), will, henceforth, be referred to as macrooperations . 

The first four macrooperations above, (1) to (4), are used in the decomposition 
phase, and the last two above, (5) and (6), are used in the forward- reduction/back- 
substitution phases (forward/back solve) of the solution process. The vectorization ? 

of the three macrooperations (3), (4), and (6) was accomplished by using the math- 
library routines of the array processor. On the other hand, the vectorization of the 
macrooperations (1), (2), and (6) required the development of a microcoded routine ♦ 

since no appropriate library routines are available to date for performing these 
tasks. The gain in speed obtained by using the array processor for each of the 
aforementioned macrooperations is given in table CIII. Note that the gain in speed 
in the operation {q} - [E]'^{r} is considerably less than that for {q} - [E] {r} 
because of the slow matrix transposition on the array processor. 


12.4 Vectorization of Solution Update and Convergence Check 

The operations involved in the solution update and convergence check of the 
Newton-Raphson technique are described by equations (C4), (C7), and (C8). The 
vectorization of these operations was accomplished by using the three math-library 
routines on the array processor: namely, vector addition (eq, (C4)), sum of the 

squares of the vector elements (eq. (C7)), and scalar times a vector (eq. (C8)). 

In the vectorization process the full vector {AX} was constructed by using its 
subvector partitions obtained from the Cholesky hypermatrix-solution process in each 
iteration. In spite of the associated storage penalty with the formation of the full 
vector {AX}, considerable improvement in the performance was obtained because of the 
reduction in the number of disk accesses and the number of calls to the array 
processor. 


12.5 Vectorization of Evaluation of Basis Vectors and Generation 

of Reduced Arrays 

The explicit form of the recurrence relations for^the basis vectors is given in 
reference 15, As mentioned before, the same matrix [k] appears on the left-hand 
side of each of the recurrence relations. Therefore, ^ the evaluation of the basis 
vectors involves: (1) decomposition of the matrix [k] once, (2) evaluation of the 

right-hand side of each of the recurrence relations, and (3) forward reduction/back 
substitution to generate each basis vector. Vectorization of the operations involved 
in (1) and (3) has already been discussed in the preceding sections. The evaluation 
of the right-hand side of each of the recurrence relations involves contraction of 
the arrays and with nodal-displacement vector and its various-order 

path derivatives. (See ref. 15.) This is accomplished by converting the arrays into 
full matrices and performing matrix-vector multiplication. 
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The procedure outlined in reference 1 5 was adopted in the present study for 
efficiently generating the reduced arrays K . . , F. , and G. by taking 

advantage of the operations performed to evaluate the right-hand sides. 


12.6 Vectorization of Strain-Energy and Stress-Resultant 
Computation (Post Processor) 

The stress resultants and strain energy can be expressed in the following 
compact form: 


{ 0 }^ = [C] } + tc]{i(x^®^)} (Cl 2) 

and 

u= ^ * 5 (x'"M) (C13) 

Elements 


where {or}j^ 
stiffness matrix; 


is the vector of stress resultants at node 

{x^®^ } 


i; [C] is the material- 
is the vector of nodal displacements for the element; 


[B]^ is the linear-strain-displacement matrix evaluated at node i; (b(x ^ )} is 
the nonlinear-strain-displacement vector evaluated at node i; U is ^e total 
strain energy, and u(x^®^) represents the contribution of the cubic and quartic 
terms in the nodal displacements to the strain energy of an individual element. 


The vectorization of the operations described in equations (Cl 2) and (Cl 3) was 
accomplished by using the math-library routines listed in table CII. To increase the 
gain in speed obtained by vectorization, the subvectors of nodal displacements and 
stress resultants were strung up into long vectors. The same was done for the 
matrix [B] and the vector {b(x^^ )}. The larger arrays could still fit into one 
segment of memory. 


12.7 Comments on Vectorization 

The following comments regarding vectorization of the numerical algorithms on 
the array processor seem to be in order: 

(1) User-developed microcoded routines are more efficient to use than the array- 
processor math-library routines when (a) specialized algorithms are used, such as the 
Cholesky hypermatrix algorithm discussed in the preceding sections, and/or (b) vector 
operations involve sparse or symmetric arrays. On the other hand, the AP math- 
library routines are generally more efficient for vector operations involving full 
nonsymmetric matrices. 

(2) Data reorganization (required prior to performing vector operations) was 
found to be slow on the array processor. An example of this is provided by the 
matrix transpose operation. (See table CIII.) 
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TABLE Cl.- TYPES OF DATA STRUCTURES USED IN VECTORIZATION OF NONLINEAR 

FINITE-ELEMENT COMPUTATIONS 




Type 

Array 

Description 

Data structure used 
in vectorization 

A 


Pull, symmetric, 2-D array 

(80 X 80) 

Expand to full square matrix 

B 


Large, completely symmetric, 
3-D array (80 x 80 x 80) 

Array treated as full non- 
zero blocks of size 

16 X 16 X 16 

c 


Large, completely sym- 
metric 4-D array 
(16 X 16 X 16 X 16) 

Expand array as if it were 
symmetric with reference to 
first three indices only 

D 

j 

Global [K] 

Large, sparse, symmetric, 
square matrix 

Matrix partitioned into square 
submatrices ( Cholesky 
hypermatrix blocks) 

E 

Partitions of ], 

stress resultants, 
strain, or dis- 
placement vectors 

Small submatrix or vector 

Combine a number of submatrices 
(or subvectors) into a larger 
matrix (or vector) 


TABLE CII.- ARRAY-PROCESSOR MATH-LIBRARY ROUTINES USED 
IN PRESENT STUDY^ 


FPS library 
subroutine function 

AP Assembler 
library code 

FPS library 
subroutine name 

Matrix multiply 

1 

MMUL 

Matrix transpose 

2 

MTRANS 

Vector addition 

3 

VADD 

Vector subtraction 

4 

VSUB 

Vector scalar multiply 

5 

VSMUL 

Sum of vector elements squared 

6 

SVESQ 


^Routines described in reference 8. 
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TABLE cm.- SPEEDUP RATIOS OF PRIME 750/AP-120B FOR VECTOR OPERATIONS 


Routine 

Block no. 
in figure B1 

Computation^ 

Linear-element- 
stiffness matrix 

1 

= x: W [B]'^[C][B] 
J?f1 

(e) 

Generate F 

UK 

( e) 

Generate G 

IJKL 

1 

1 

See reference 14 
See reference 14 

Cholesky*s method 
with hypermatrices 

3,4 

Macrooperation (1 ) 
Macrooperation (2) 
Macrooperation (3 ) 
Macrooperation (4) 
Macrooperation ( 5 ) 
Macrooperation (6(a)) 
Macrooperation (6(b) ) 

Solution update and 
convergence check 

14 

{X} * q 

N o 
{X} + {ax} 

Strain energy and 
stress resultant 
computation 

18 

i 

1 

Linear strain energy 
Stress resultant 


Assembler coded Estimated ratio of 
or AP math library Prime 750/AP-120B 


Library 


Library 

Assembler 


Assembler 
Assembler 
Library 
Library 
Assembler 
Library 
Li brary 


Library 

Library 

Library 


Li brary 
Library 


An asterisk * denotes multiplication 
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BUFFERING TECHNIQUES USED FOR MINIMIZING I/O OPERATIONS ON HOST AND 

ARRAY PROCESSOR 

The buffering techniques used for minimizing the I/O operations on the Prime 750 
are described in this appendix* Extension of these techniques to the DEC VAX -*11/780 
are also given* Then, the minimization of I/O operations on the AP-1 2 OB, as well as 
> between the host minicomputer and the array processor, is presented* 


13.1 Buffering Techniques for Host Computers 

13*1 * 1 Description of buffering techniques on Prime 750 * - 

13*1*1*1 Standard FORTRAN binary I/O: When standard FORTRAN binary I/O is used, 

the portion of the code employed to measure the output of 10 000 displacement vectors 
has the following form: 

REAL*4 DISP 
DIMENSION DISP(5) 

CALL SNAP 
DO 10 1=1,10 000 
WRITE(6) DISP 
1 0 CONTINUE 

CALL SHOT ('FORTRAN BINARY,' 14) 

Standard FORTRAN binary I/O uses a system I/O buffer of size- 30 words plus a 
2-byte pointer. The entire buffer is transmitted to disk at the end of each write 
statement* This produces a large output file of 1387 records. (See fig* D1 * ) 

Each Prime disk record equals 220 words (880 bytes) . The total word output equals 
305 140, compared with a user output of 50 000 words* Standard FORTRAN binary I/O 
proves inefficient because of the constant size of the system I/O buffer. 

13.1*1.2 Set FORTRAN system buffer size: Prime system routine "ATTDEV" ("attach 

device") allows the user to set the size of the FORTRAN system I/O buffer. This 
routine is called once outside the "do loop" in the portion of code listed above, 
which sets the system I/O buffer to five words* The use of this technique con- 
siderably reduces the number of records in the output file compared with those 
used by the first technique (250 records as compared with 1387 records). (See 
fig. D1 . ) 

13.1.1.3 System I/O routine: Prime system I/O routine PRWF$$ (position, read, 

and write file) permits direct transfer of data from memory to disk, thus completely 
bypassing the FORTRAN I/O buffer. Only user data are written to disk. No pointers 
or zero filling are used to enlarge the output file as with FORTRAN binary I/O* The 
write statement in the aforementioned FORTRAN is replaced with a call to PRWF$$. The 
use of this buffering technique reduces the CPU time by a factor of 4*5 and increases 
the data rate by a factor of 3.8 from the corresponding values when the second 
technique is used. (See fig. D1.) 
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Although PRWF$$ is a powerful I/O routine, it has the following two limitations: 

(1) It can transmit data only to contiguous memory locations, 

(2) It cannot transmit data across segment boundaries* 

The first limitation is not a serious one since buffers are located in contiguous 
memory locations* 

13*1*1*4 User buffering (32-bit words): This technique involves copying 32-bit 

data items into a user I/O buffer the size of one Prime segment (32 000 words) and 
outputting the buffer to disk when either it is full or the I/O operation is 
completed* Three routines are used for this method: INITBF (Initialize Buffer) 

initializes the output file and sets a pointer to zero; WRITBF (Write to Buffer) 
copies 32-bit data items into the buffer and outputs them to the specified logical 
unit when the buffer is full; and DONEBF (Done with Buffer) flushes the buffer and 
closes the output file* The use of this technique results in significant decreases 
in CPU and I/O times, as well as a dramatic increase in data rate* (See fig* D1 * ) 

The improvement in performance is due to the minimal number of disk accesses required 
by this method* The favorable CPU-I/0 ratio of 2*27 enhances the prospects of CPU- 
I/O overlap within the Prime 750* PRIMOS (Prime Operating System) is capable of 
scheduling parallel processing between the CPU and disk I/O to allow higher system 
throughput. 

13*1.1*5 User buffering (64-bit words): This technique is identical to the 

preceding one except that the data are copied in 64-bit words into the segment-size 
buffer* The data rate improves by a factor of 2 over the preceding technique, and 
the CPU-l/0 ratio becomes most favorable for parallel CPU-I/0 operation on the 
Prime 750* 

13.1*1*6 User buffering all data - system pages buffer: This technique involves 

no user output to disk* The data are copied to a very large user buffer* If the 
buffer cannot fit in central memory, the system pages the data to and from disk as 
they are needed. This technique is used for large arrays produced as intermediate 
results. 

Of the six techniques described, the last one produced the best results for 
total time and data rate. (See fig. D1.) This can be attributed to the absence of 
user-initiated I/O. The technique is restricted, however, for use with data that are 
not needed for permanent disk storage, and it should be used with care because of its 
large central-memory requirements. 

13.1*2 Performance of I/O buffering techniques on the Prime 750 .- To evaluate 
the effectiveness of the buffering techniques described in the preceding section in 
reducing the I/O operations, the techniques were applied to the cylinder problem with 
rectangular cutout (3230 degrees of freedom). Most of the I/O time in the program 
was expended in four processors, namely, generation of basis vectors, decomposition 
of the left-hand side, forward reduction/back substitution, and assembly of the 
elemental stiffness matrices and load vectors. The dramatic reductions in the I/O 
times obtained by using these techniques for the four processors are shown in fig- 
ure D2. The use of these techniques transformed the I/O-bound program to one which 
is CP bound. 

In figure D2, T^ refers to the I/O times obtained by using the third buffering 
technique* The times T 2 and T^ are associated with a buffering technique which 
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is a combination of techniques 3, 5, and 6 outlined in appendix D* The number of 
nodes per hypermatrix block are 12 and 5, corresponding to array sizes of 60 x 60 and 
25 X 25, respectively. Note that the improvement obtained by using the smaller-size 
arrays (5 nodes per hypermatrix block - 25 x 25 arrays) is due to the smaller number 
of zeros stored in the blocks. 

The total (CP + I/O) times expended in 10 different processors for the 
2 cylindrical-shell problems are shown in figures D3 and D4, The 10 processors are 
as follows; (1) evaluation of linear-stiffness arrays (2) evaluation of 

nonlinear-stiffness arrays and (3) assembly of elemental matrices; 

(4) incorporation of boundary conditions ; (5) decomposition of the global-linear- 
stiffness matrix [K]; (6) forward reduction/back substitution; (7) solution update; 
(8) postprocessing (stress resultant and strain-energy computation); (9) generation 
of basis vectors; and (10) solution of reduced equations. Note that three of these 
processors were not vectorized; namely, the assembly, boundary conditions, and 
solution of reduced equations. Also, since the basis vectors were generated at zero 
loading, no convergence check was needed. The percentage of time expended in each of 
the processors and the estimated gain in speed obtained by using the array processor 
are summarized in table IV, 

As can be seen from table IV and figures D3 and D4, most of the solution time is 
spent in the two processors used for the decomposition of the left-hand side and in 
the evaluation of basis vectors and the generation of reduced equations. The use of 
the AP (with minimal host I/O) reduces the total times (CP + I/O) expended in these 
two processors by factors of 46,7 and 17,1 for the pear-shaped cylinder and by 
factors of 59,8 and 38.1 for the cylinder with a cutout. The effect of I/O on total 
gains in speed for the two problems is a reduction in CP gains of 9,2 and 18,1, 
respectively, to total (CP + I/O) gains of 5.2 and 9,9, This difference of a factor 
of two between the CP gain and the I/O gain can be attributed to high I/O in the 
assembly of elemental matrices and generation of basis vectors, 

13,1,3 Application of buffering techniques to DEC VAX-11/780 ,- Five of the six 
buffering techniques outlined above were implemented for the DEC VAX-1 1/780 minicom- 
puter, (See fig, D5,) Although results are not as dramatic as on the Prime 750, 
steady increases in data rate result from use of these techniques. The authors 
believe that these techniques are not machine dependent and would be useful on most 
virtual-memory minicomputer systems. 

Buffering techniques 3 to 6 of figure D5 used the VAX macro routine "#WRITE” to 
perform output to the disk. Subroutines INITBF, WRITBF, and DONEBF had to be written 
in VAX-11 macro language because a FORTRAN callable-system I/O routine was not 
available. 

i 

I 

13.2 Buffering Technique for tYje AP-120B 

The buffering technique presented herein involves table and main-data memory of 
the AP-120B, Data are buffered from main-data memory into table memory to avoid 
interleave problems with main-data memory. The foll/owing example was taken from the 
Cholesky decomposition of the global-stiffness matrix [k], (See block 3 in 
fig, B1 • ) The FORTRAN code used in the decomposition of the diagonal blocks and the 
execution trace of this code are 


43 



APPENDIX D 


FORTRAN Code: 


DO 40 1=2, N 
IM1 = 1-1 
DO 20 L=1, IMl 

SUM = SUM + A(L,I) * A(L,I) * A(L,L) 

20 CONTINUE 
40 CONTINUE 

As an example, consider the 5x5 symmetric matrix [A] whose elements occupy 
the following locations in memory: 



1 

6 

1 1 

16 

21 


2 

7 

1 2 

17 

22 

[A] = 

3 

8 

13 

18 

23 


4 

9 

14 

19 

24 


_5 

10 

15 

20 

25_ 



IMl 




A(L, 



A(L,L) 


Execution trace = 2 

1 

1 



6 



1 


3 

2 

1, 

2 


11, 

1 2 


1, 7 


4 

3 

1, 

2, 

3 

16, 

17, 

18 

1, 7, 

1 3 

5 

4 

1 , 

2, 

3, 4 

21, 

22, 

23, 24 

1, 7, 

13 


To perform the multiplications indicated in the FORTRAN program as vector 
operations, the pairs of elements A(L,I) and A(L,L) need to be fetched from main- 
data memory. Consider, as an example, the case where 1=3 and L = 1 . The ele- 
ment A(L,I) in location 11 is fetched from memory, and then the element A(L,L) in 
location 1 is fetched from memory. Since both elements are in odd memory locations, 
they reside in the same memory bank. Therefore, the access of the second element 
results in a spin operation. The execution trace shown above indicates that spin 
operations are certain to occur frequently. Each spin operation adds 167 ns to the 
loop times required to produce one result in a vector operation. 

The spin operations are a direct consequence of interleaving in main-data 
memory. Since no interleaving is involved in accessing the table memory, the 
aforementioned problem can be overcome by using the table memory as a buffer for the 
main-data memory, that is, treating the two memories as a two-level virtual-memory 
system on the array processor. Only the three hypermatrix blocks needed by the 
Cholesky algorithm at any one time are kept in table memory. The main-data memory 
is then used as a mass-storage area wherein a large number of hypermatrix blocks 
reside. All computational modules are made to reference table memory only, thereby 
taking advantage of its faster access speed. The transfer of blocks between main- 
data memory and table memory is accomplished by using the math-library routines of 
the AP, 
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13,3 Minimizing I/O Between Host Computer and Array Processor 

On the user-software level, the following two techniques can be used to reduce 
host-system overhead; 

(1) Minimizing the number of calls to the array processor: This can be achieved 

by the following: 

(a) Merging smaller arrays (e.g*, vectors and matrices) into larger ones: If 

a number of short vectors are combined into a single long vector, only one data- 
transmission call to the AP will be needed. This may require a change in I/O block 
sizes and/or a change in the numerical-solution algorithm being used. 

(b) Chaining several array-processor routines: When a number of array-processor 

subroutines are needed, it is more efficient to call these subroutines from a single 
main array-processor routine instead of calling them separately from the host FORTRAN 
program. Several routines can now be sent to the array processor in one call if the 
program-source memory is large enough. Program-source memory ranges in size from 512 
to 3096 words with 256-word increments. 

(2) Loading the most frequently used routines first: This is because the array- 

processor executive (APEX) uses a last-in, first-out technique in allocating memory 
space for the source program. 
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Figure D1 Performance of buffering techniques on the Prime 750 to output 10 000 vectors to disk 
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Figure D2,- I/O reductions obtained by using buffering techniques on Prime 750, Cylindrical shell 

with rectangular cutout shown in figure 10. 
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Figure D4.- Estimated total speedup ratios (CPU + I/O) obtained by using minicomputer/array-processor 
system. Cylindrical shell with rectangular cutout is shown in figure 7. 
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Figure d 5,- Performance of buffering techniques on the DEC VAX-1 1/780 to output 10 000 vectors to disk 
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TABLE I.- SAMPLE TIMINGS OF VECTOR- SOFTWARE OPERATIONS FOR LIBRARY ROUTINES 

DEVELOPED FOR THE AP-1 20B 


[Library routines developed by Floating Point Systems, Inc, 


data taken from ref. 8. 



Time , jos 
(b) 


Startup + Loop time 


Sum of vector elements 




0.833 + 0.333n 


Sum of squares of vector elements I ^ A? 


1 .333 + 0.333n 


Vector add {a} + {b} 


2.667 + O.SOOn 


Vector multiply (term by term) {a} * {b} 


2.667 + 0.500n 


Vector scalar multiply I A * {b} 


2.667 + 0.500n 


Vector scalar multiply and constant vector add . . 


A * {B} + {C} 


3.333 + 0.500n 


Dot product {a} * {b} 


3.333 + O.SOOn 


Vector multiply and add (term by term) {a} * {b} + {C} 


3.333 + 0.833n 


Two vector multiplies and one vector add ( {a} * {b}) + ( {c} * {d}) 


4.167 + 0.833n 


Vector divide (term by term) {a}/{b} 


4.167 + 1 .667n 


An asterisk * denotes multiplication. 

The symbol n denotes the length of vector 
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TABLE III.- 


HARDWARE AND SOFTWARE FACTORS AFFECTING I/O OPERATIONS 
IN MINICOMPUTER/ARRAY-PROCESSOR SYSTEM 


Computer 
I/O level 

Host computer 

(cache memory, central memory, 
and disk) 

Array processor 
(main-data and table memories) 

Host/AP interface 

Hardware 

• Cache-memory size 

• Central -memory size 

• Memory's access speed 

• Main-data-memory size and 

speed 

• Table-memory size 

• Main-data-memory interleave 

• Single global data base 

• Use of peripheral storage 
device with AP 

System 

software 

• Resource management and 

allocation 

• Paging algorithm between 

• Central memory and disk 

• Cache and central memories 

• AP system I/O routines used 
in transferring data 
between table and main- 
data memories 

• Host-operating- system response 
time to setup transfer from 
host to AP by APEX routines 

User 

software 

• Buffering techniques 

between central memory 
and disk 

• Buffering techniques between 
table and main-data 
memories 

• Minimize number of calls to AP 
. Merging small data sets 

• Chaining AP routines 

• Load most frequently used 

routines first 

• Overlap data transfer and 

AP execution 




TABLE IV.- TIMING DISTRIBUTION FOR TWO BENCHMARK PROBLEMS 
USED IN PRESENT STUDY 





Time, percent 


Speedup ratic 

) 


Block no. 

CPU 

CPU + 

I/O 

CPU 


CPU + 

I/O 

Routine 

in fig. B1 

Problem 

Problem 

Problem 

Problem 



1 

2 

1 

2 

1 

2 

1 

2 

Generate form independent ] 

1 

6.2 

8.2 

5.7 

7.7 

6.4 

20.6 

6.9 

20.5 

f e 1 

Generate form independent ' and 

G^®^ arrays 

1 

3.6 

4.5 

3.4 

4.4 

7.5 

6.5 

7.3 

6.3 

Assembly 

2 

4.5 

2.4 

4.5 

2.6 

1 .0 

1 .0 

1 .0 

1 .0 

Incorporation of boundary conditions .... 

2 

<0.01 

<0.001 

0.7 

<0.001 

1 .0 

1 .0 

1 .0 

1 .0 

Decompose [K] 

3 

36.1 

44.8 

34.6 

43.4 

18.7 

28.0 

46.7 

59.8 

Forward reduction, back substitution .... 

4 

4.9 

4.9 

5.1 

4.9 

7.0 

18.7 

45.9 

51 .1 

Convergence check and solution update . . . 

14 

<0.001 

<0.001 

<0.001 

<0.001 

4.0 

1 .7 

1 7.0 

19.4 

Post processor 

18 

2.9 

0.5 

2.6 

0.5 

2 .2 

1 .9 

2 .2 

2.1 

Generate basis vectors and reduced 

equations 

15 

41 .6 

34.6 

43.1 

36.5 

5.8 

8.5 

17.1 

38.1 

Solution of reduced equations 

8 

<0.01 

<0.001 

<0.01 

<0.001 

1 .0 

1 .0 

1 .0 

1 .0 


Total 


9.2 


18.1 


5.2 


9.9 
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Figure K- Schematic drawing of organization of minicomputer/array-processor system. 
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Figure 4.- User interface with minicompu ter/array-processor system. 
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Figure 5.- Effect of vector length on performance of AP-120B for data already 

available on the AP. 
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Figure 8«- Estimated CPU speedup ratios obtained by using minicomputer/array-processor system. 

Pear-shaped cylindrical shell is shown in figure 6, 
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Figure 9,- Estimated CPU speedup ratios obtained by using minicomputer/array-processor system. 
Cylindrical shell with rectangular cutout is shown in figure 7. 
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